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Abstract.  It  has  been  observed  by  many  researchers  that 
the  protein-coding  regions  of  DNA  sequences  exhibit  a 
period-3  behavior  due  to  codon  structure.  Identification 
of  the  period-3  regions  helps  in  predicting  the  gene  loca¬ 
tions,  and  in  fact  allows  the  prediction  of  specific  exons 
within  the  genes  of  eucaryotic  cells.  Traditionally  these 
regions  are  identified  with  the  help  of  techniques  such  as 
the  windowed  DFT.  In  this  paper  we  consider  the  use  of 
efficient  digital  filters  for  the  same  purpose.  The  filters  can 
be  designed  not  only  to  extract  the  period-3  component, 
but  at  the  same  time  effectively  eliminate  the  background 
1  //  spectrum  exhibited  by  nearly  all  DNA  sequences.1 

I.  INTRODUCTION 

It  is  well-known  that  base  sequences  in  the  protein-coding 
regions  of  DNA  molecules  exhibit  a  period-3  pattern  be¬ 
cause  of  the  codon  structure  involved  in  the  translation 
of  base  sequences  into  amino  acids  [11],  [12].  For  eucary¬ 
otes  (cells  with  nucleus)  this  periodicity  has  mostly  been 
observed  within  the  exons  (coding  subregions  inside  the 
genes  [1])  and  not  within  the  introns  (noncoding  subre¬ 
gions  in  the  genes).  There  are  theories  explaining  the  rea¬ 
son  for  such  periodicity,  but  there  are  also  exceptions  to 
the  phenomenon.  Nevertheless,  many  researchers  have  re¬ 
garded  the  period-3  property  to  be  a  good  (preliminary) 
indicator  of  gene  location.  Techniques  which  exploit  this 
property  for  gene  prediction  proceed  by  computing  the  dis¬ 
crete  Fourier  transform  (DFT)  with  a  sliding  window.  This 
is  expected  to  exhibit  a  peak  at  the  frequency  27r/3  due 
to  the  periodicity.  This  technique  has  successfully  been 
used  to  identify  exons  within  the  genes  of  eucaryotic  cells 
[2,  11].  The  periodic  behavior  indicates  strong  short-term 
correlation  in  the  coding  regions,  in  addition  to  the  long- 
range  correlation  or  l//-like  behavior  exhibited  by  DNA 
sequences  of  many  organism  in  general  [6,9,15]. 

Digital  signal  processing  techniques  offer  more  efficient 
ways  to  identify  regions  of  the  DNA  exhibiting  periodic 
behavior.  Such  methods  have  typically  not  been  used  in 
the  biotechnology  community.  For  example,  digital  band¬ 
pass  filters  with  a  narrow  passband  are  often  very  effective 
in  extracting  the  period-3  information  and  attenuating  the 
1  //  behavior.  In  this  paper  we  describe  a  number  of  meth¬ 
ods  to  obtain  such  filters,  and  examine  their  performance 
on  DNA  sequences  from  the  genome  database. 


1Work  supported  in  part  by  the  ONE  grant  NOOO 14-99-1- 
1002,  USA. 
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II.  PERIODICITY  IN  CODING  REGIONS 

Figure  1(a)  demonstrates  a  simple  schematic  for  part  of  a 
DNA  molecule  [1],  with  the  double  helix  straightened  out 
for  convenience.  The  four  bases  or  nucleotides  attached 
to  the  sugar  phosphate  backbone  are  denoted  with  the 
usual  letters  A,  C,  G,  and  T.  The  forward  genome  se¬ 
quence  . . .  ATTCATAGT . . .  corresponds  to  the  upper 
strand  of  the  DNA  molecule.  Note  that  the  ordering  is 
from  the  so-called  5'  to  the  3'  end  (left  to  right). 
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Figure  1.  (a)  The  DNA  double  helix  (linearized  schematic), 
and  (b)  various  regions  in  a  DNA  molecule. 

As  shown  in  Figure  1(b),  a  DNA  sequence  can  be  divided 
into  genes  and  intergenic  spaces.  The  genes  are  responsible 
for  protein  synthesis.  A  gene  can  be  divided  into  two  sub- 
regions  called  the  exons  and  introns.  (Procaryotes,  which 
are  cells  without  a  nucleus,  do  not  have  introns).  Only  the 
exons  are  involved  in  protein-coding.  The  bases  in  the  exon 
region  can  be  imagined  to  be  divided  into  groups  of  three 
adjacent  bases.  Each  triplet  is  called  a  codon.  Scanning  the 
gene  from  left  to  right,  a  codon  sequence  can  be  defined  by 
concatenation  of  the  codons  in  all  the  exons.  Each  codon 
(except  the  so-called  stop  codon)  instructs  the  cell  ma¬ 
chinery  to  synthesize  an  amino  acid.  The  codon  sequence 
therefore  uniquely  identifies  an  amino  acid  sequence  which 
defines  a  protein.  Since  there  are  64  possible  codons  but 
only  20  amino  acids,  the  mapping  from  codons  to  amino 
acids  is  many-to-one.  The  introns  do  not  participate  in  the 
protein  synthesis. 
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It  has  been  observed  more  than  two  decades  ago  [12] 
that  the  base  sequence  in  the  coding  regions  (exons)  have 
a  strong  period-3  component.  Some  authors  have  claimed 
that  this  is  due  to  nonuniform  codon  usage:  even  though 
there  are  several  codons  which  could  code  a  given  amino 
acid,  they  are  not  used  with  uniform  probability,  and  this 
creates  a  codon  bias.  There  is  an  excess  of  guanine  (G)  in 
position  1,  leading  to  strong  period  3  oscillation  [5].  The 
work  by  Tiwari,  et  al.  [11]  seems  to  indicate  that  this  expla¬ 
nation  is  not  complete.  Indeed,  these  authors  “synthesize 
genes”  by  starting  from  proteins  and  mapping  aminoacids 
back  to  codons.  In  this  reverse  mapping  process,  they 
assign  “uniform  probability”  to  the  different  codons  that 
might  lead  to  a  given  amino  acid.  The  resulting  pseudo 
gene  has  been  found  to  retain  the  period  3  property! 

III.  DNA  SPECTRUM  VERSUS  DNA  FILTERING 

To  perform  gene  prediction  based  on  the  period-3  prop¬ 
erty,  one  defines  indicator  sequences  for  the  four  bases  and 
computes  the  DFTs  of  short  segments  of  these.  Given  a 
DNA  sequence,  the  indicator  sequence  for  the  base  A  is  a 
binary  sequence,  e.g., 

xA(n)  =  000110111000101010 . . . 

where  1  indicates  the  presence  of  an  A  and  0  indicates  its 
absence.  The  indicator  sequences  for  the  other  bases  are 
defined  similarly.  It  is  clear  that  the  sequence  111111 ... 
is  obtained  by  adding  the  four  indicator  sequences.  The 
DFT  of  a  length- AT  block  of  XA( n)  is  defined  as 

N- 1 

XA[k]  =  J2  xA{n)e~^kn'N ,  0  <  Jfc  <  N  -  1, 

n=0 

where  we  have  assigned  the  number  n  =  0  to  the  begin¬ 
ning  of  the  block.  The  DFTs  Xr[k],Xc[k],  and  Xc[k] 
are  defined  similarly.  The  period-3  property  of  a  DNA  se¬ 
quence  implies  that  the  DFT  coefficients  corresponding  to 
k  =  N/ 3  are  large.  Thus  if  we  take  TV  to  be  a  multiple  of 
3  and  plot 

S[k]  =  |X4fc]|2  +  |XT[fc]|2  +  |Xc[fc]|2  +  \XG[k}\2  (1) 

then  we  should  see  a  peak  at  the  sample  value  k  =  N/3 
as  demonstrated  in  many  papers  (e.g.,  [11]).  While  this  is 
generally  true,  the  strength  of  the  peak  depends  markedly 
on  the  gene.  It  is  sometimes  very  pronounced,  sometimes 
quite  weak.  Notice  that  a  calculation  of  the  DFT  at  the 
single  point  k  =  N/3  is  sufficient.  The  window  can  then 
be  slided  by  one  or  more  bases  and  5[AT/3]  recalculated. 
Thus,  we  get  a  picture  of  how  5[A^/3j  evolves  along  the 
length  of  the  DNA  sequence.  It  is  necessary  that  the  win¬ 
dow  length  N  be  sufficiently  large  (typical  window  sizes  are 
a  few  hundreds,  eg.,  351,  to  a  few  thousands)  so  that  the 
periodicity  effect  dominates  the  background  1/f  spectrum 
which  makes  its  strong  presence  in  DNA  sequences  [9],  [15]. 
However  a  long  window  implies  longer  computation  time, 
and  also  compromises  the  base-domain  resolution  in  pre¬ 
dicting  the  exon  location. 

Digital  filtering  method .  The  sliding  window  method 
can  be  regarded  as  digital  filtering  followed  by  a  decimator 


which  depends  on  the  separation  between  adjacent  posi¬ 
tions  of  the  window  [3,  13].  The  filter  itself  has  a  very 
simple  impulse  response 

w{n)  =  !  e^on  0  <  n  <  N  -  1 
[  0  otherwise. 

This  is  a  bandpass  filter  with  passband  centered  at  l?o  = 
2n/3  and  minimum  stopband  attenuation  of  about  13  dB 
(Fig.  2).  This  tells  us  that  if  we  pay  more  careful  at¬ 
tention  to  the  design  of  the  digital  filter,  we  can  isolate 
the  period-3  behavior  from  background  information  such 
as  1/f  noise  more  effectively.  We  can  also  use  efficient 
methods  to  design  and  implement  the  filter,  thereby  re¬ 
ducing  computational  complexity. 


Figure  2.  The  filtering  effect  of  DFT  computation. 


Figure  3.  A  digital  filter  H(z)  with  indicator  sequence 
Xq(u)  as  its  input. 

Consider  a  narrow  band  bandpass  digital  filter  H (z)  with 
passband  centered  at  u>o  =  2ir/3 .  With  the  indicator  se¬ 
quence  Xc(n)  taken  as  input,  let  f/o(n)  denote  its  output. 
Note  that  n  should  be  interpreted  as  base  location.  In  the 
coding  regions,  the  sequence  Xg(ti)  is  expected  to  have  a 
period-3  component,  which  means  that  it  has  large  energy 
in  the  filter  passband.  So  we  expect  the  output  2/g(^)  1° 
be  relatively  large  in  the  coding  regions  as  demonstrated 
in  Fig.  3.  With  similar  notation  for  the  other  bases,  define 

r[n]  =  |?M(n)|2  +  Iyr(n)|2  +  \yc(n)\2  +  |yG(n)| 2 

A  plot  of  this  function  can  be  used  as  a  preliminary  indi¬ 
cator  of  coding  regions.  The  narrow  band  filter  H(z)  can 
be  regarded  as  an  antinotch  filter  (i.e.,  complement  of  a 
notch).  We  now  describe  some  efficient  ways  to  design  and 
implement  such  filters. 

IV.  HR  ANTINOTCH  FILTERS 

The  use  of  HR  antinotch  filters  for  gene  prediction  was  pro¬ 
posed  in  [14].  Such  IIR  filters  can  be  obtained  by  starting 


from  a  second  order  allpass  filter 


R2  —  2Rcos6z  1  4  z  2 
A{Z>  =  1-  2Rcos0z~l  TWF* 


we  can  adjust  R2  depending  on  the  base-domain  resolution 
desired. 


which  has  poles  at  Re^&  and  zeros  at  1  /Re^G.  Thus, 
consider  a  filter  bank  with  two  filters  G(z)  and  H ( Z )  de¬ 
fined  according  to 


\G(z)  1 

1  [i  i  ]  r  i 

2  [l  -lj  A{z) 

(2) 


Then  G(z)  has  the  form 


G(z) 


K 


1  —  2cosu/o  z  1  4-  z  2 
1  —  2Rcos8z~l  4  R2z~2 


where 


2  R  cos  0 


“  T TW 

This  shows  that  G(z)  is  a  notch  filter  [10]  with  a  zero  at  the 
frequency  Wo-  When  the  pole  radius  R  is  close  to  the  unit 
circle  we  see  that  Wo  gets  close  to  6.  That  is,  the  pole  and 
zero  of  the  filter  G(z)  are  very  close  to  each  other.  Thus,  at 
frequencies  sufficiently  away  from  Wo,  the  response  is  close 
to  unity.  This  is  demonstrated  in  Fig.  4,  which  shows  the 
magnitude  response  of  G(z)  for  two  values  of  R.  From  Eq. 
(2)  we  see  that 


'G(ejuJ)~ 

U 

1 

~V2 

_A(eju)  _ 

where  U  is  unitary,  that  is,  UeU  =  I.  This  shows  that 


\G(en\2  +  \H{e*“)\2 


1  +  \A(eju>)\2 
2 


where  we  have  used  the  property  |<4(eJU,)|  =  1.  It  there¬ 
fore  follows  that  G(z)  and  H(z)  are  power  complementary. 
This  shows,  in  particular,  that  the  filter  H(z)  is  a  good 
antinotch  filter  as  demonstrated  in  Fig.  5,  for  the  same 
pole  radii  chosen  in  Fig.  4. 

By  choosing  Wo  =  27t/3  the  filter  H(z)  can  be  used  to 
extract  the  period-3  regions  of  the  DNA  effectively.  The 
allpass  filter  A(z)  can  be  implemented  with  either  the  di¬ 
rect  form  structure  or  the  cascaded  lattice  structure  [8], 
[13].  The  lattice  structure  with  one-multiplier  sections  [13] 
is  especially  attractive  [10],  and  Fig.  6  shows  the  imple¬ 
mentation  of  H(z)  using  this  lattice.  The  multipliers  in 
this  structure  are  the  lattice  coefficients 


quality  control  frequency  control 

Figure  6.  Lattice  structure  for  implementing  the  antinotch 
filter  H(z)  =  V{z)/X(z). 

V.  MULTISTAGE  FILTERS 


hi  =  i?2,  &2  —  —  COSO^o* 

Since  the  antinotch  frequency  is  u>o  =  27t/3  we  have 

k2  ~  —  cos  wo  =  1/2 

which  can  be  implemented  with  a  binary  shift.  So  the 
only  significant  multiplier  is  i?2,  and  controls  the  antinotch 
quality  without  affecting  the  frequency  Wq  (Fig.  5).  Thus 


Even  though  the  HR  antinotch  method  has  been  found  to 
work  well  [14],  there  is  room  for  improvement.  We  will 
show  that  with  a  slight  increase  in  the  number  of  multipli¬ 
ers  we  can  design  filters  with  much  better  stopband  atten¬ 
uation.  Such  filters  are  essential  in  order  to  suppress  the 
background  1/f  noise  which  is  always  there  in  the  DNAs 
of  many  organisms,  due  to  long-range  correlation  between 
base  pairs. 
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Figure  7.  The  multistage  design  of  a  narrowband  band¬ 
pass  filter,  (a)  Magnitude  response  of  lowpass  prototype 
Hi(z),  (b)  multiband  response  of  H\(z 3),  (c)  the  re¬ 
sponse  of  #2(2)  which  eliminates  the  unwanted  passband 
at  u  =  0. 


The  method  to  be  presented  is  based  on  the  idea  of  multi¬ 
stage  filtering  [3,13].  To  explain  this  consider  a  narrowband 
lowpass  filter  H\(z)  as  shown  in  Fig.  7(a).  If  we  replace 
each  delay  element  z"1  in  the  filter  with  z~ 3,  we  get  the 
filter  H\(z3)  whose  response  is  as  shown  in  Fig.  7(b). 
Thus,  there  is  a  passband  centered  at  27t/3  and  a  pass- 
band  at  u  =  0.  If  we  now  cascade  this  with  a  filter  #2(2) 
which  attenuates  the  zero-frequency  passband  severely,  the 
resulting  filter 


H(z)  =  HjCsVaW 

is  a  narrowband  filter  with  passband  centered  at  27t/3.  We 
will  demonstrate  that  H\{z)  and  #2(2)  can  be  designed 
with  very  low  complexity,  and  that  the  filter  predicts  the 
exons  with  good  accuracy.  The  multistage  idea  is  similar 
in  principle  to  the  IFIR  method  introduced  by  Neuvo,  et 
al.  [7,13]. 

Figure  8  shows  an  example.  Here  H\(z)  is  a  third  order 
elliptic  filter  and  #2(2)  is  chosen  to  have  two  zeros  at 
w  =  0,  that  is, 


H2(z)  =  (1-z~1)2. 


The  various  filter  responses  involved  in  the  multistage  de¬ 
sign  are  shown  in  the  figure.  The  bottom  plot  shows  the 
multistage  filter  H(z)  which  has  a  narrow  passband  at 
U  =  27t/3,  and  excellent  attenuation  at  most  frequencies. 
Implemented  in  direct  form  [8],  H\(z)  requires  5  multipli¬ 
ers,  and  7/2(2)  is  multiplierless. 

It  should  be  noticed  here  that  H\(z)  can  be  imple¬ 
mented  using  the  allpass  decomposition  method  [13],  which 
allows  the  third  order  elliptic  filter  to  be  written  in  the  form 

t,  A0(z)  +  A1(z) 

Hi{z)  = - ^ - 

where  Aq(z)  is  a  first  order  allpass  filter  and  A\(z)  a  sec¬ 
ond  order  allpass  filter,  both  with  real  coefficients.  We 
can  implement  A${z)  and  A\{z)  with  one  and  two  multi¬ 
pliers  respectively  [13],  so  that  Hi(z)  requires  only  three 
multipliers.  Summarizing,  the  multistage  method  has  only 
slightly  higher  complexity  than  the  allpass-based  antinotch 
filter,  but  its  characteristics  are  significantly  better. 


Figure  8.  Magnitude  responses  of  filters  in  the  multistage 
design  method.  From  top  to  bottom:  the  HR  lowpass  filter 
Hi(z),  the  expanded  version  7/i(z3),  the  FIR  filter  7/2(2), 
and  the  multistage  filter  7/i(z3)  7/2(2). 
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VI.  EXAMPLES  AND  CONCLUSIONS 

We  show  in  Fig.  9  the  exon  prediction  results  for  gene 
F56F11.4  in  the  C-elegans  chromosome  III.  This  gene  has 
five  exons.  The  first  plot  uses  the  DFT  based  spectrum 
described  in  Sec.  III.  The  five  peaks  corresponding  to  the 
exons  can  be  seen  clearly.  The  middle  plot  uses  the  allpass- 
based  antinotch  filter  with  pole  radius  R  =  0.992.  This 
scheme  can  be  implemented  with  only  one  multiplier  per 
output  sample  (i.e.,  per  base  pair).  Both  of  these  methods 
locate  the  five  exons  quite  well,  but  we  also  notice  the 
background  “noise”  due  to  the  1/f  characteristics  in  DNA 
sequences.  The  third  plot  uses  the  multistage  filter  H(z) 
shown  in  the  bottom  of  Fig.  8.  Notice  that  the  background 
noise  has  been  removed  almost  completely  and  the  five 
exons  can  be  seen  clearly. 


relative  base  location  n 


Figure  9.  Top  plot:  the  DFT  based  spectrum  S[N/ 3]  for 
gene  F56F11.4  in  the  C-elegans  chromosome  III.  Middle 
plot:  the  antinotch  filter  output  (Sec.  IV)  for  the  same 
gene.  Bottom  plot:  the  multistage  narrowband  bandpass 
filter  output  (Sec.  V)  for  the  same  gene. 


As  explained  in  detail  in  [4],  gene  identification  is  a  very* 
complex  problem,  and  the  identification  of  period-3  regions 
is  only  a  step  towards  gene  and  exon  identification.  In  fact, 
Tiwari,  et  al.  [11]  have  observed  that  some  genes  do  not 
exhibit  period-3  behavior  at  all  in  S.  cerevisiae  (e.g.,  genes 
of  the  mating  type  locus).  The  period-3  property  has  often 
been  attributed  to  the  dominance  of  the  base  G  at  certain 
codon  positions  in  the  coding  regions.  We  have,  in  fact, 
observed  experimentally  that  the  use  of  the  base  G  alone, 
instead  of  all  four  bases,  often  leads  to  excellent  prediction 
of  period-3  regions. 
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